The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). (Source: https://www.kaggle.com/competitions/titanic)

Brief Introduction

This Notebook focuses on Stacking to predict if the "test" passengers survive or not. We include a total of 5 models in the stacking for which we did hyperparameter tuning and provide feature importance. Before building the predictive model, we do a brief EDA followed by feature selection and feature engineering.

Table of Contents

EDA

We have different numerical and categorical features, some of which have missing values:

Some Interesting questions for EDA:

Overall the survivability rate seems to be similar among age groups with the exception of young children aged 0 to 10 (smaller

The passengers in the group with higher Fares seem to have a higher survivability rate than the ones with the lowest Fares

At first sight, most titles seem to relate either to gender, age or profession. Some appear with very low frequency such as Don, Capt or Lady, while others are very common: Master, Mrs, Miss & Mr. We also observe survivability highly varies among the different title groups with those pointing to females and to higher socioeconomic status having higher survivability.

Gender correlates the most with survivability, followed by Fare and Passenger Class

We basically have a classification problem to handle: we want to know if a passenger will survive or not depending on their features

We'll run a simple logistics regression, but first we should do a bit of data engineering first in order to prepare the data

Data Preprocessing

From the correlation matrix we also know 'SibSp' and 'Parch' have little correlation with survivability, whereas the variable we derived from those two ('Company') shows a fairly high correlation. To avoid a multicollinearity issue, we drop both 'SibSp' and 'Parch'.

ML models

logistics regression

Random Forest

Support Vector Machine

KNN

Gradient Boosting

As seen from the feature importances we ploted for each of the 5 models Fare, Title, Age and Gender are usually the most important, while Embarking location seems to be irrelevant in most cases, nevertheless we proceed and don't change anymore features for our final model.

Stacking

By stacking the above models, we combine the predictions of the models mentioned above, and although we achieve a similar score to the best individual model we had, we go ahead and use the stacking classifier to predict the survival of the passengers in the testing data which also helps us not to overfit the models.

Predicting